Sequential Pattern Discovery under a Markov Assumption
نویسندگان
چکیده
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. There are a number of fundamental aspects of this data mining problem that can make discovery “easy” or “hard”—we characterize the difficulty of learning in this context using an analysis based on the Bayes error rate under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.
منابع مشابه
Generative Modeling of Itemset Sequences Derived from Real Databases
The problem of discovering temporal and attribute dependencies from multi-sets of events derived from realworld databases can be mapped as a sequential pattern mining task. Although generative approaches can offer a critical compact and probabilistic view of sequential patterns, existing contributions are only prepared to deal with sequences with a fixed multivariate order. Thus, this work targ...
متن کاملA novel grey–fuzzy–Markov and pattern recognition model for industrial accident forecasting
Industrial forecasting is a top-echelon research domain, which has over the past several years experienced highly provocative research discussions. The scope of this research domain continues to expand due to the continuous knowledge ignition motivated by scholars in the area. So, more intelligent and intellectual contributions on current research issues in the accident domain will potentially ...
متن کاملDoes Fundraising Have Meaningful Sequential Patterns? The Case of Fintech Startups
Nowadays, fundraising is one of the most important issues for both Fintech investors and startups. The pattern of fundraising in terms of “number and type of rounds and stages needed” are important. The diverse features and factors that could stem from Fintech business models which can influence success are of the key issues in shaping these patterns. This study applied the top 100 KPMG Fintech...
متن کاملData Mining for Web Personalization
In this chapter we present an overview of Web personalization process viewed as an application of data mining requiring support for all the phases of a typical data mining cycle. These phases include data collection and preprocessing, pattern discovery and evaluation, and finally applying the discovered knowledge in real-time to mediate between the user and the Web. This view of the personaliza...
متن کاملA reservoir-driven non-stationary hidden Markov model
In this work, we propose a novel approach towards sequential data modeling that leverages the strengths of hidden Markov models and echo-state networks (ESNs) in the context of nonparametric Bayesian inference approaches. We introduce a non-stationary hidden Markov model, the time-dependent state transition probabilities of which are driven by a high-dimensional signal that encodes the whole hi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002